Day 6: What Are LLMs? The Principles of Large Language Models
An LLM (Large Language Model) is not simply a “big model.” Once it surpasses a certain scale, abilities that were impossible in smaller models suddenly emerge. This is what makes LLMs special.
The Core of LLMs: Next Token Prediction
The training objective of LLMs is surprisingly simple. Predicting the next token is all there is to it.
# Pre-training of LLMs boils down to this
# Input: "The weather today is really"
# Target: "nice"
# By repeating this task over trillions of tokens,
# the model learns grammar, facts, and even reasoning abilities
def next_token_prediction_loss(model, text_tokens):
"""The core loss function of pre-training"""
total_loss = 0
for i in range(1, len(text_tokens)):
context = text_tokens[:i] # Previous tokens
target = text_tokens[i] # Next token (ground truth)
predicted = model(context) # Model's prediction
loss = cross_entropy(predicted, target)
total_loss += loss
return total_loss / (len(text_tokens) - 1)
Scaling Laws
Optimal training rules revealed by the Chinchilla paper (2022):
| Parameters | Optimal Tokens | Example Model |
|---|---|---|
| 1B | 20B tokens | Phi-2 class |
| 7B | 140B tokens | Llama 2 7B |
| 13B | 260B tokens | Llama 2 13B |
| 70B | 1.4T tokens | Llama 2 70B |
# Chinchilla law: optimal tokens ~ 20 x parameter count
def optimal_tokens(num_params_billions):
return num_params_billions * 20 # Unit: billions (B) of tokens
# Example
for size in [1, 7, 13, 70]:
tokens = optimal_tokens(size)
print(f"{size}B model -> optimal {tokens}B tokens for training")
However, recent models often exceed this rule. Llama 3 8B was trained on 15T tokens, roughly 100 times the Chinchilla optimal.
Characteristics by Parameter Scale
model_capabilities = {
"Under 1B": {
"capable": ["Simple classification", "Sentiment analysis", "Keyword extraction"],
"difficult": ["Complex reasoning", "Long text generation", "Code generation"],
"examples": "DistilBERT, TinyLlama",
},
"7B - 13B": {
"capable": ["General conversation", "Basic coding", "Summarization", "Translation"],
"difficult": ["Mathematical reasoning", "Complex analysis"],
"examples": "Llama 3 8B, Mistral 7B",
},
"30B - 70B": {
"capable": ["Complex reasoning", "Advanced coding", "Long document analysis"],
"difficult": ["Top-tier mathematics", "Specialized medical/legal"],
"examples": "Llama 3 70B, Mixtral 8x7B",
},
"Over 100B": {
"capable": ["Advanced reasoning", "Multi-turn conversation", "Creative writing"],
"note": "Emergent abilities fully manifest",
"examples": "Latest top-tier GPT models, Latest Claude Opus",
},
}
for size, info in model_capabilities.items():
print(f"\n{'='*40}")
print(f"Scale: {size}")
print(f"Capable of: {', '.join(info['capable'])}")
print(f"Examples: {info['examples']}")
Emergent Abilities
# Emergent abilities: capabilities that suddenly appear when model size crosses a threshold
# Success rate is near 0% in small models, then spikes at a certain size
emergent_abilities = {
"Chain-of-Thought reasoning": "Reaching correct answers through step-by-step thought processes",
"Few-shot learning": "Performing new tasks after seeing just a few examples",
"Code generation": "Converting natural language descriptions into code",
"Multilingual translation": "Translating language pairs not seen during training",
"Arithmetic reasoning": "Solving multi-step math problems",
}
print("Emergent abilities of LLMs:")
for ability, description in emergent_abilities.items():
print(f" - {ability}: {description}")
# Note: Recent research argues that "emergent abilities" may be
# an artifact of the measurement method.
# The claim is that the sharp transition disappears when
# evaluation metrics are made continuous.
The power of LLMs comes from performing a simple objective (next token prediction) at an extreme scale. Starting tomorrow, we’ll examine actual LLM models one by one.
Today’s Exercises
- Intuitively explain how the ability to answer questions can emerge from “just predicting the next token.” (Hint: the training data includes Q&A-format text)
- According to the Chinchilla law, calculate the optimal number of training tokens for a 3B parameter model, and research how many tokens Phi-3 (3.8B) was actually trained on.
- Research the argument that emergent abilities are “measurement artifacts” and form your own opinion.